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Challenging research in various fields has driven a wide range of 
methodological advances in variable selection for regression models 
with high-dimensional predictors. In comparison, selection of non- 
linear functions in models with additive predictors has been consid- 
ered only more recently. Several competing suggestions have been 
developed at about the same time and often do not refer to each 
other. This article provides a state-of-the-art review on function 
selection, focusing on penalized likelihood and Bayesian concepts, 
relating various approaches to each other in a unified framework. 
In an empirical comparison, also including boosting, we evaluate 
several methods through applications to simulated and real data, 
thereby providing some guidance on their performance in practice. 
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1 Introduction 



Challenging substantive research questions and technological innovations, be- 
ginning in molecular life sciences, stimulated a revival of methodological and 
applied research for selection of influential components in regression models. 
There exists now a vast literature on variable selection in linear regression mod- 
els 



yi = (^0 + PiXii + . .. + (3pXip + ei 



1, . . . ,n. 



(1) 



as well as in generalized linear models or hazard rate models, with high-dimensional 
linear predictor r/|'°, where p is (very) large compared to the sample size n. 

In comparison, corresponding research on function selection in additive mod- 
els 



/3o + fi{xii) + ... + fp{xip) + ei 



(2) 



is more recent and still comparably sparse, especially so for non-Gaussian models 
with additive predictors r]f'^'^, possibly including further components such as 
linear effects and interaction terms. 

The following related issues are of interest in function selection: Should a 
function fj{xj) be included in the predictor at all? Because functions are usu- 



ally centered around zero, we therefore want to know whether 



differs 



significantly from zero. Moreover, we are often interested whether the effect is 
linear, that is fj{xj) = f3jXj or nonlinear, that is, does fj{xj) differ significantly 
from a straight line? Is the approximation of the regression function through an 
additive predictor good enough or do we have to incorporate varying coefficient 
terms, such as /i(xi)x2, or interactions such as /i2(3:^i) 2:2)? 

The aims of this article are as follows: We provide a state-of-the-art survey on 
recent advances in function selection. Our focus is on penalized least squares or 
likelihood methods and on Bayesian concepts, in particular based on selection 
indicators and spike- and-slab priors. We sketch - mostly asymptotic - results 
on model selection consistency, convergence rates or oracle properties, as far as 
they are available. To explore and to illustrate performance in finite sample 
situations, we compare and evaluate several methods through application to 
simulated and real benchmark data. Naturally, usable software implementation 
has been a key criterion for including a selection method into this empirical 
evaluation. This is the first empirical comparison of several recent methods for 
function selection and provides guidance on their performance in practice. 

Most frequentist and Bayesian approaches for function selection are based on 
or motivated from variable selection in regression models with high-dimensional 
linear predictors as in ([I]). Very popular penalization methods for variable 



selection are the Lasso Tibshirani 19961, the SCAD penalty iFan and Li 2001 



and modifications such as the adaptive Lasso iZou 2006 , the groupLasso lYuan 



and Lin 2006 and the groupSCAD Wang et al. 



2007 



Another branch is 



boosting [Biihlmann and Yu 2003 Biihlmann and Hothorn 2007 , which we 



include as a further competitor in our empirical analyses since it is applicable to 
a wide range of regression models and is implemented in the R package m boost 



Hothorn et al. 2012 



Bayesian variable selection in linear predictors is often based on spike-and-slab 
priors for single regression coefficients and has been most extensively discussed 
for the classical linear model ([T]), see George and McCuUoch 1993 1997 . The 
basic idea is that each coefficient /3i is modeled as having come either from a 
distribution with most (or all) of its mass concentrated around zero (the 'spike'), 
or from a comparably diffuse distribution (the 'slab'). A closely related concept 
is to introduce selection indicator variables for coefficients being zero or non- 

Instead of placing spike-and-slab priors directly on 
assume a spike-and-slab prior 



zero 



Smith and Kohn, 1996 



regression coefficients, Ishwaran and Rao 12005 



for the variance of (conditionally) Gaussian priors. This concept is conveniently 
extended to generalized linear models, see Fahrmeir et al. [2010 



review paper. 



O'Hara and Sillanpaa 2009 



In a recent 
compare several Bayesian variable 

Section 



2011 



selection methods for linear models; see also Fahrmeir and Kneib 
4.5.3]. 

The common feature of penalization methods for function selection is to place 
a penalty on certain norms of functional components or associated blocks of basis 
function coefficients. The penalty term usually enforces sparseness or smooth- 
ness of regression functions, but both properties are also related to each other 
through the choice of the function space. Lin and Zhang 12006 proposed the 



component selection and smoothing operator (COSSO) in additive smoothing 



spline ANOVA models, with extensions to exponential family responses in Zhang 



and Lin 2006 and to hazard regression in Leng and Zhang |2006| . Ravikumar 



et al. 



2009 enforce sparsity in additive models by imposing a penalty on the 

incorporate 



empirical L2-norm of functional components. Meier et al. 2009 



an additional smoothness penalty and Radchenko and James 20101 impose a 
further L2-norm penalty to obtain sparse hierarchical structures in models with 
two-way functional interactions. All these methods include only one global 
penalty parameter to enforce sparseness of main effects, or a second one to 
control the amount of smoothness of main effects or hierarchical structures in 
two-way interaction models. This leads to a tendency to oversmooth nonzero 
functional components in order to set unimportant functions to zero. Motivated 
by the adaptive groupLasso, Storlie et al. [2011 propose the adaptive COSSO 
(ACOSSO) to penalize each functional component in additive models differently 
so that more flexibility is obtained for functional components with more curva- 



ture while shrinking unimportant components more heavily to zero. Zhang et al. 



|2011b| extend the adaptive COSSO by decomposing nonlinear functions into a 
linear part and a nonlinear deviation. A similar adaptive groupLasso approach 



is studied in Huang et al. 2010 . Instead of penalizing directly the norms of ba- 



sis function coefficient vectors, leading to groupLasso-type penalties, Xue 12009 



for additive models] and Wang et al. 2007 for varying coefficient models], insert 
norms into the SCAD penalty, leading to groupSCAD-type penalties. 

Belitz and LanglpOOS ] propose a simple and computationally efficient method 



for simultaneous estimation and selection in (generalized) additive models based 
on the quadratic P-spline penalty in combination with an extended backfitting 
algorithm. In a double penalty approach, Marra and Wood [2011 decompose 
the quadratic smoothing spline penalty for generalized additive models into two 
terms corresponding to the null space, e.g. linear functions, and the penalty 



cubic spline deviations from linear functions, shrinking both 
A related idea has been suggested in Avalos et al. 2007 for 



range space, e.g. 
of them to zero. 

additive models, shrinking linear functions to zero in Lasso-type fashion. 

While most penalization methods have been primarily developed for additive 
models, boosting approaches are also available for more general semiparametric 
predictors and non-Gaussian models, see e.g. Tutz and Binder [2006 and Kneib 



et al. 2009 



Bayesian function selection is mostly based on introducing spike-and-slab pri- 
ors with a point mass at zero for blocks of basis function coefficients or, equiv- 
alently, indicator variables for functions being zero or nonzero. 
(2003 



Wood et al. 



mKm and Yau et al 



describe implementations using a data-based prior 
that requires two MCMC runs, a pilot run to obtain a data-based prior for the 
"slab" part and a second one to estimate parameters and select model compo- 
nents. Panagiotelis and Smith [2008 combine this indicator variable approach 
with partially improper Gaussian priors, as for basis coefficients of Bayesian 
P-splines, in high-dimensional additive models. They suggest several sampling 
schemes that dominate the scheme in Yau et al. 2003 . A more general approach 



based on double exponential regression models that also allows for flexible mod- 
eling of the dispersion is described by Cottet et al. 12008 . They use a reduced 



rank representation of cubic smoothing splines with a very small number of 
basis functions to model the smooth terms in order to reduce the complexity 
of the fitted models, and, presumably, to avoid mixing problems. [Reich et al. 
|2009| consider a Gaussian functional ANOVA model with (conditionally) con- 
jugate Gaussian process priors for functions. To perform function selection, 
they impose a spike-and-slab prior on variances of Gaussian processes, with a 
point mass at zero for the spike and a half-Cauchy prior for the slab. It seems 
difficult, however, to extend it to non-Gaussian regression models. 

A novel spike-and-slab prior structure for function selection in the broad class 
of structured additive regression models for Gaussian and discrete responses is 
proposed in Scheipl 2011a , Scheipl et al. 2012 and implemented in the R 



package spikeSlabGAM [Scheipl 2011b 



2 Penalized least squares and likelihood-based function 
selection 



2.1 General concept 

Let y = (ui, . . . ,yn)' denote the vector of all responses, rj = (rji, . . . ,r]n)' the 
vector of corresponding predictor values, and fj = {fj{xij),...,fj{xnj)) the 
vector of function evaluations of a function fj{xj). All penalization methods 
measure the fit of a regression model by some loss function loss(y, ij) or equiv- 
alently through a utility function utility(?/, ry) = —loss{y,T]). 
The L2-I0SS 

loss{y, rj) = \\y - r]\\'^ 



i.e., the squared Euclidian distance between y and r/, is commonly used for 
additive models y = ri^'^^ -\- e and extensions including a linear predictor rj^™ 
or interaction terms. Sometimes the empirical L2-I0SS H?/ — ^711^/^ is considered 
as a slight modification. For Gaussian i.i.d. errors the L2-I0SS coincides with 
the negative log-likelihood, up to an additive constant. The L2-I0SS is also 
considered for non-Gaussian errors with distributions with similar shapes, but 
then it is only a (negative) quasi-log-likelihood. To improve efficiency, one may 
consider a weighted L2-I0SS, with weights proportional to the inverse standard 
deviations. For non-Gaussian responses with exponential family distributions, 
in particular binary, categorical or count responses, the negative log-likelihood 

loss(2/,77) = -l(y,??) 

is a more natural choice. For Cox-type hazard regression models or quasi- 
hkehhood models the loss is defined by the (negative) partial log-likelihood or 
some pseudo-log-likelihood. 

To regularize estimation of high-dimensional regression models and to en- 
force sparseness through component selection and/or smoothness of functions, 
a penalty function pen(?7) is introduced, including one or more tuning param- 
eters A to control the amount of penalization. For models with an additive 
predictor ([2|, the penalty is a sum of penalties for each function 

p 

Pen(r/) = ^penj(/j), 

j=i 

where each penalty term penj(fj) includes one (or two) tuning parameter com- 
mon to all of them, or separate, in case of tensor product terms possibly multi- 
dimensional, tuning parameters Xj for each of them. We distinguish between 
sparseness and smoothness penalties. Sparseness penalties are constructed to 
shrink (suitable norms of) functions to zero, deleting "unimportant" functions. 
Smoothness is - hopefully - obtained as a by-product, e.g. through the choice of 
a suitable space of smooth functions. Smoothness penalties explicitly encourage 
the fitting of smooth functions. Modifications of well-known smoothing spline 
penalized or regression spline approaches are required, however, to encourage 



sparseness of predictors. Section 2.2 describes sparseness penalties in more de- 
tail, while Section 2.3 deals with smoothness penalties or a combination of both 
types. 

The ultimate goal is to (simultaneously) estimate and select functions / = 
(/i, . . . , fj , . . .) by minimizing the penalized loss 

loss(^/, 77) -|- pen(T7) — t- min 

where is a prespecified suitable function space. Tuning parameters are as- 
sumed to be fixed or known in this minimization problem and are estimated 
by minimizing additional criteria, in particular cross-validation, information 
criteria such as (modifications of) AIC, BIG, and Mallows Gp, or (restricted) 
marginal likelihood (REML). 



In almost all approaches considered in Sections 2.2 and 2.3 the function 
space Tj for each function fj is a linear space defined through basis functions 
Bjk{x), k = 1,2, . . ., such as the space of polynomial splines equipped with a 
B-spline basis. Then 

/i(a;) = 2]/3ifci3,fc(x) (3) 

k 

and the vector fj of function evaluations can be expressed as 

/, = X,f3j (4) 

with basis function values Bjk{xij) as elements of the design matrix Xj. Then 
the penalized loss can be expressed as 

loss(t/,/3) + pen(/3) — )• nnn, 

where (3 is the vector of all basis function coefficients. For an additive model 
^ and an L2-I0SS, we obtain the penalized least squares criterion 



PLS(/3) = \\y- l/3o - VX,/3,||' + Vpen^(/3,) ^ min. (5) 

/9o,/3 

with 1 = (1,...,!)' denoting a vector of ones. To guarantee identifiability, 
appropriate constraints have to be introduced. In particular, we have to as- 
sume non-concurvity of the functions and ensure that each function estimate is 
centered around zero. 

Many function selection penalties are based on or related to the groupLasso. 
Yuan and Lin |2006| introduced it for linear models as an extension of the 



Lasso for selecting groups of coefficients (3j associated with factor variables or 
a polynomial function of a continuous covariate. The groupLasso estimator is 
defined as the solution to 

||y-l/3o-X;^.A-|l' + '^Ell/3.ll^. ^i"^^- (6) 

where the design matrix Xj corresponds to the jth factor. After centering 
covariates and responses, the unpenalized intercept term may be omitted for 
simplicity. The norms in the groupLasso penalty are defined by 



ii/3,iiK, = (/3;K,a 



\l/2 



where Kj is a dj x dj positive (semi-)definite matrix Kj, dj = dim(/3j). For 
dj = 1, j = 1, . . . ,p, the groupLasso reduces to the usual Lasso. For Kj = I^j, 
the norm reduces to the Euclidian norm ||/3j||. To scale for different dimensions 
of the coefficient vectors, Kj = djl^j may be chosen, so that reduces to 

■^/clj 1 1/3 jl |. Yuan and Lin 2006 discuss shrinkage properties of the groupLasso 
and suggest a coordinate- wise algorithm for estimating the coefficients. Solution 



paths look similar as for the Lasso: Depending on the shrinkage parameter A, 
(3j may be exactly zero and the corresponding group of variables is removed 
from the model. 

Conceptually, the groupLasso is easily extended to high-dimensional general- 
ized linear models: The L2-I0SS has to be replaced by the negative log- likelihood 
of a generalized linear model, e.g. a logit model, see Meier et al. 20081. How- 



ever, parameter estimation becomes more demanding. An implementation can 



be found in the R-package grplasso Meier 



2009 



In analogy, the SCAD penalty of Fan and Li 2001 can be extended to a 



groupSCAD penalty by inserting some norm ||/3j||i<-j into the (scalar) argument 
of the SCAD penalty. This seems promising, at least from an asymptotic point 
of view, because the SCAD selector possesses the oracle property, in contrast to 
the Lasso selector. 



2.2 Sparseness penalties 



Ravikumar et al. 20091 develop function selection for sparse additive models 



based on the sparseness penalty 

pen(/,) = A(i/;./,y/^ = A||/,.|U 

where ||/j||n is the empirical L2-norm, and the functions are assumed to be 
centered around zero. Thus, the norm ||/j||n penalizes deviations from zero and 
shrinks function to zero. The shrinkage parameter A is the same for all functions. 
To minimize the PLS criterion, they propose a "sparse" backfitting algorithm 
(SPAM) which works for any smoother, e.g. also for local linear or kernel 
smoothers. Thus, smoothness is considered implicitly. Their approach and 
parts of the algorithm can be seen as a "functional" version of the groupLasso. 
This becomes evident if functions are represented through basis functions. Then 
= Xj(3j and the empirical L2-norm becomes 

pen(/3,) = A||/3,||K,, K, = ^x'^X, (7) 
which is a groupLasso penalty. 



Radchenko and James 2010 extend the sparse additive model by incorporat- 
ing two-way interaction terms fji. They suggest the penalty function 

pen(r,) = A,X:(||/,||^+ ± | I + A2 £ £ 

j=i i=j+i j=i i=j+i 

where 1 1 • 1 1 is the usual L2-norm. The first term penalizes all model terms, while 
the second term additionally discourages interaction terms to enter the model 
in order to support the hierarchical structure of main and interaction effects. 
Huang et al.||2010] propose an adaptive groupLasso for additive models, with 



functions fj{xj) represented through B-splines. After centering the B-splines 
Bjk{x) by redefining 



1 " 

Bjk{x) := Bjk - Bjk, Bjk{x) = - ^ Bjk{. 



n 
1= 



(and centering responses around their mean) , they suggest to estimate (3j through 
a PLS criterion with the adaptive groupLasso penalty 



(8) 



While the shrinkage parameter A is estimated through minimizing the BIC, 
the weights wj are determined in a first step via the groupLasso estimator, in 
analogy to the adaptive Lasso of Zou 2006) for linear models. Thus, another 
shrinkage parameter has to be estimated in this step. Because the adaptive 
Lasso possesses the oracle property, one may expect desirable properties for the 



adaptive groupLasso (see Section 2.4). 

Xue] [20091 and |Wang et al.] |200 7| suggest a grouped version of the SCAD 



penalty for function selection in additive and varying coefficient models respec- 
tively. Functions fj{xj) are again represented through a B-spline expansion. 
Xue| |2009| defines the groupSCAD penalty through 



pen(/3-) =px{\\(3A\k, 



where Kj = XjXj/n as in ([7|, while Wang et al. 2007| insert the Euclidian 



norm \\/3j\\ into the SCAD penalty of Fan and Li 



2001 



It is given by 



pxiw) 



'X\w\ 

-{w"^ -2aX\w\ +A2) 

(a+l)A^ 



if < A 

if A < \w\ < aX 
if \w\ > aX, 



which is a quadratic spline with knots at A and aX, with a = 3.7 as a stan- 
dard option. Because the SCAD selector in linear models possesses the oracle 
property, one may again expect desirable asymptotic properties for groupSCAD 



selectors (see Section 2.4). 



2.3 Smoothness penalties 

Smoothness penalties are well-known from representing and fitting smooth func- 
tions fj{xj) in additive and generalized additive models with predictors 

P 

through smoothing splines or penalized regression splines. Expressing splines 
by their basis function representations fj = Xj(3j in estimates fj and 0j 
are obtained as the minimizers of 

p 

\oss{y,P) +'^penj{f3j) 
where 

pen^.(/3,) = A,/3;.K,/3, = X,\\f3j\\],^. (9) 



The loss function is the L2-norm for AMs and the negative exponential family 
log- likelihood for GAMs, and Kj are the penalty matrices of smoothing splines 
or P-splines measuring roughness of the fitted functions. Although the smooth- 
ness penalties ([9| look similar to sparseness penalties, there are three important 
differences. First, the penalty matrices Kj are only positive semidefinite and 
the null space corresponds to functions which are not penalized. For the most 
common case of cubic splines, the null space consists of linear functions. This 
implies that functions will not be shrunk towards zero without some modifica- 
tion. Second, instead of the (semi-)norm the quadratic norm is used. 
Thus, (3j is a Ridge-type but not a Lasso- type estimator. Third, the penalty 
terms have separate smoothness parameters \j to allow for different degrees of 
smoothness of the various functions fj{xj), j = 1, . . . ,p. In contrast, all spar- 
sity penalties have one (or two) common shrinkage parameters, leading to a 
tendency to oversmooth functions. 

To circumvent the difficulty that smooth functions will in general not be 



estimated as zero even if they are nuisance functions, Marra and Wood 2011 
impose an additional penalty on the null space. Let 

be the spectral decomposition of the smoothing matrix, with zero eigenvalues in 
Aj corresponding to the null space of Kj . Then an extra penalty can be defined 
based on K* = Uj{Uj) , where U* is the matrix of eigenvectors corresponding 
to the zero eigenvalues of Aj. This allows to impose a double penalty 

XjP'jKjfBj + X*j(3'jK*Pj 

on each component fj. By estimating Aj and A* separately, a function fj can 
now be completely removed from the predictor. For the case of cubic smoothing 
splines, the first term penalizes deviations from a straight line and the second 
term penalizes straight but non-horizontal lines. This double penalty approach 



is implemented for GAMs in the R package mgcv Wood 2012 



A related idea has been considered by Avalos et al. 12007 for additive models. 



They impose a Lasso penalty on the coefficients of linear terms. 



Belitz and Lang 2008) develop a computationally very efficient modification of 



backfitting for simultaneous component selection and model estimation. Their 
approach applies to the very broad class of structured additive regression, with 
continuous and discrete response as in generalized linear models. The additive 
predictor can include linear terms, smooth functions as well as interactions, 
spatial effects for geoadditive regression, and random effect terms for clustered 
or longitudinal data. Univariate functions are modelled through P-splines, two- 
way interactions through corresponding penalized tensor product-splines, spatial 
components through Gaussian Markov random fields, and (uncorrelated) ran- 
dom effects are assumed to be i.i.d. Gaussian. Each of these different types of 
"functions" fj is associated with a quadratic penalty 



pen(/3j) = XjWPjWK^, 



where Kj depends on the different types of functions. For univariate smooth 
functions, Kj is the usual regression sphne penalty matrix, a Markov random 
field precision matrix for spatial components, and an inverse correlation matrix 
for Gaussian random effects. Estimation and selection are carried out by modi- 
fied backfitting for each of the components: At each step, a number of alternative 
smoothers is computed for a grid of smoothing parameters, determined through 
equivalent degrees of freedom, as well as for the null function, which corresponds 
to removing the component. The "best" alternative, measured through a GCV 
or AIC score, is selected until the iterations stop at convergence. 

Comparing with sparseness penalties in the previous subsection, it seems nat- 
ural to replace quadratic norms ||/3j||xj. in ([9| by norms ||/3j||/<-j. Meier et al. 
|2009| consider additive models (|2]), where (centered) functions are represented 
through B-spline basis functions. They introduce a sparsity-smoothness penalty 



pen(/,) = Ai ||/,||2+A2 {fAx)fdx 



1/2 



(10) 



where the first term encourages sparsity, and the second, weighted by the (com- 
mon) smoothing parameter A2, measures smoothness through the usual cubic 
smoothing spline penalty. With the B-spline representation fj = XjfBj, one 
obtains 



pen(/3,) = Ai ||/3^.||j^^. +A2||/3,||^, 



1/2 



where Kj = XjXj/n as in ([T]) and K'j is the cubic smoothing spline penalty. 
As an alternative, they also suggest 



pen(/3j) = Ai||/3j-||kj + A2||/3j||j<r*, 
a weighted sum of the sparseness and the smoothness norm. 



(12) 



Lin and Zhang 2006 proposed the component selection and smoothing opera- 



tor (COSSO) within the framework of functional ANOVA models, decomposing 



a regression function f{xi, 



into the sum of main effects fj{xj), two- 



way interaction effects and higher-order interaction effects similar to classical 
ANOVA models. Estimation of the components can be based on the theory of 



reproducing Hilbert spaces (RKHS), see Wahba 19901 and Gu 2002 for thor- 



ough exposures. In practice, however, only main effects or two-way interaction 
models are considered. We focus on additive models ^ and assume that each 
function fj{xj), Xj G [a,b], lies in the second order Sobolev space 



S2 



f :/ and / are continuous, / exists and is 



square-integrable, i.e. / / (t)dt < 00 



Extensions to higher-order Sobolev spaces and ANOVA decompositions are con- 
sidered in Lin and Zhang |2006|. The norm of a function / G ^2 is defined as 



2 2b 21 1/2 

The penalty for each main effect is 
Pen(/,) = X\\fj\\, 

and the COSSO estimate in additive models ^ is the minimizer of the PLS 
criterion 



n I p \ ^ p 

i=i \ j=i J j=i 



(14) 



The COSSO penalty is also a sparsity-smoothness penalty, comparable to but 
different from the sparsity-smoothness penalty (10) of Meier et al. [2009 : The 
first term in (13) is the usual (squared) sparsity L2-penalty, while the second 
and third term are the squared penalties used in traditional spline smoothing. 
However, there is only one common tuning parameter A to control the amount 
of sparsity and smoothness. 

RKHS-theory shows that minimizers fj, j = 1, . . . ,p, of (14) can be repre- 
sented as 



k=l 

where Xkj, k = 1, . . . , n, are the observed values of the covariate xj, and K{s, t) 
is the "reproducing kernel" of 52, equipped with the norm (13). Its explicit form 
is 

K{s,t) = 1 + ki{s)ki{t) + k2{s)k2{t) - kids - t\), 
with ki{s) = s- 0.5, /C2(s) = [kfis) - l/12]/2, 
his) = [kKs] - kl{s)/2 + 7/240]/24, 



see 



Wahba 1990 eq. 10.2.4]. Thus fj{x) has a basis function representation 



(|3]), with basis functions Bjj^{x) = K{x,Xkj), and the function penalty A||/j|| 
defined in (13) can be determined based on this representation as a function of 



the regression coefficients to obtain a corresponding PLS criterion. 

The COSSO is extended to generalized additive (and interaction) models 
in Zhang and Lin 120061, replacing the L2-I0SS by the negative log- likelihood 



resulting from exponential family distributions, and to Cox-type hazard rate 
models 



X{t\xi, ... ,Xp) = Ao(t)exp(/i(xi) . . . fp{xp)) 



in Leng and Zhang] |2006| , replacing the L2-I0SS by the negative partial log- 
likelihood. 



Because of the relationship to the groupLasso, the COSSO will not possess 
the oracle property. The common tuning parameter A also leads to a tendency 
to oversmooth nonzero function in favor of sparsity. Storlie et al. 2011 sug- 



gest the adaptive COSSO (ACOSSO) by introducing weights into the COSSO 
penalty, resulting in the ACOSSO penalty 



^^WjWfjW or X"^Wj\\pj\\K,, 

in analogy to the adaptive groupLasso 
are determined in a first step as 



8| of [Huang et"aL] |2010| . The weights 



fj{x)dx 



where fj is the usual smoothing spline estimate of fj, and 7 > 3/4, e.g. 7=1. 
As Storlie et al. 2011 show, ACOSSO has favorable asymptotic properties. 



including the oracle property, and dominates COSSO in simulation studies. 
Zhang et al. |2011a extend ACOSSO by decomposing nonlinear functions in 
additive models into a linear effect and a nonlinear deviation, orthogonal to 
each other. This allows to decide if a significant effect is linear or nonlinear. 



2.4 Properties of selectors 

The following properties are desirable but not always met, even asymptotically. 
They are basically the same as for variable selection in models with linear pre- 
dictors. 



Model selection consistency 

All influential functions or components are selected and all non-influential nui- 
sance components are removed with probability tending to 1 when the sample 
size tends to infinity. In finite samples, this property cannot be met exactly, 
and goodness of selection is measured through false negative rates, i.e., rates at 
which influential components are not selected, and false positive rates, i.e., rates 
at which spurious nuisance components are selected. Alternatively, one may of 
course use true positive and negative rates. 

In models with sparse predictors, where the number of zero components is very 
large and may even grow asymptotically faster than the sample size, selection 
consistency is also called sparsistency. 

Although selection consistency or sparsistency are desirable, they are not al- 
ways met. A weaker form is what we call selection sub-consistency: Asymptoti- 
cally the true nonzero functions or components are only a subset of the selected 
components. 



Estimation consistency 



All functions or components are estimated consistently, that is, estimates fj 
converge (in probability) to the true nonzero or zero functions fj . If functions are 



represented through some (truncated) basis function expansion, corresponding 
groups of coefficients f3j are estimated consistently. 



Convergence rates and the oracle property 



Convergence rates measure the speed of convergence of consistent estimators. 
If convergence for simultaneous selection and estimation in sparse models is as 
fast as for estimation of the true (unknown) submodel that only contains the 
nonzero components, than the selection and estimation procedure has the oracle 
property. 

For models with high-dimensional linear predictors, the Lasso selects mod- 
els consistently under regularity assumptions but does not possess the oracle 
property. The adaptive Lasso and the SCAD selection procedure possess the 
oracle property. Therefore, one may expect similar properties for function se- 
lection procedures in models with additive predictors based on the (adaptive) 
groupLasso or groupSCAD. 

Rigorous proofs of asymptotic properties are only available for additive mod- 
els. All authors assume that functions fj are centered around zero, either by 
assuming 'E{fj{xj)) = for random covariates Xj, or by centering functions em- 
pirically, so that Y17=i fji^ij) ~ 0- -^o^ simplification, we may also assume that 
responses are centered around their mean, so that the intercept term /3q can be 
omitted. 

Most authors relax the conventional assumption of Gaussian i.i.d. errors 
For example Huang et al. |2010 assume i.i.d. 

2 



to a certain extent 

E(ej) = and Var(ej) = a'^, with tail probabilities satisfying P 
i^(exp —Cx"^) for some constants C > and K 



> 0. 



errors with 
e,;| > x) < 



Meier et al. 



2009 



and 



Storhe et alT] |2011| allow errors to be uniformly sub-Gaussian, that is there exist 
constants K > and C > such that 



sup max E(exp(ej/K)) < C. 

n i=l,...,n 



An important difference is whether the number q of nonzero functions fj ^ 
0, j = 1, . . . , q, and the number of zero functions fj = 0, j = q + 1, . . . ,p, are 
fixed or may grow with a certain rate with increasing sample size n. Lin and| 



Zhang 2006 



Storlie et al. 2011| and Xue |2009 | assume that q as well as p (and 

assume 



therefore the number p — (7 of zero functions) is fixed. Huang et al. [2010 



that q, the number of nonzero functions 7^ 0, is fixed, but p and therefore 



the number of zero functions may even grow faster than the sample size. Meier 



et al. 



2009 



consider the case where the numbers of zero and of nonzero /j's 
may both be larger than n. 

In many applications it seems realistic, that the number q of nonzero influen- 
tial functions is fixed while p may grow with n in sparse models. Therefore, we 
take a closer look at the assumptions and the results for the adaptive groupLasso 
selector of Huang et al. 2010 . They consider model selection and estimation 



in additive models, where each function fj belongs to a class T of smooth func- 
tions / on an interval [a, b], say, whose kth derivative /(*^^ exists and satisfies a 



Lipschitz condition of order a G (0, 1], i.e. 



The order of smoothness is defined as c? = + a. For example, the second order 
Sobolev space S2 is such a class T with d = 1 + 1 = 2. 

Beyond the assumptions on the errors ej already stated, they assume that the 
covariate vectors a;, = (xji, . . . , Xjp) , i = 1, . . . , n, are realizations of a random 
covariate vector x with continuous density, with marginal densities of Xj 
satisfying < Ci < gj[x) < C2 < 00 on [a, for every j = 1, . . . ,p. 

To select and estimate functions with the adaptive groupLasso, each function 
fj is approximated through a centered B-spline expansion 

m 
k=l 

where m also grows with n at an appropriate rate. Finally, the shrinkage pa- 
rameter Ao of the groupLasso estimator, needed to estimate the weights Wj of 
the adaptive groupLasso in a first step, as well as the shrinkage parameter A of 
the adaptive groupLasso penalty, have to increase with the sample size n, with 
rates defined in a number of theorems and corollaries on selection consistency 
and convergence rates of the adaptive (as well as the non-adaptive) groupLasso 
estimator. In particular, their Corollary 2 provides sufficient conditions which 
are easy to verify and guarantee that the adaptive groupLasso estimator using 
the groupLasso as the initial estimator is selection consistent and achieves opti- 
mal rate of convergence, that is it possesses the oracle property. More precisely. 
Corollary 2 of Huang et al. 2010 says: 



Let the groupLasso estimator with Aq ~ \/n log(pm) and m ~ j^i/2rf+i ]-,g 
the initial estimator in the adaptive groupLasso. Suppose that the number of 
nonzero components q is fixed and that the above assumptions on error and 
covariates hold. If the shrinkage parameter A < 0(n^/^) satisfies 

A , ^l/4d+2i l/2(p^) 

— o(l) and =o(lj. 



^(8d+3)/(8d+4) ^ ^ X 

a further condition restricting its admissible growth rate, then the adaptive 
groupLasso consistently selects the nonzero components, that is 

P(||/il|2>0, j = l,...,q, and ||/,||2 = 0, j = q + 1, . . . ,p) ^ 1. 

In addition. 

In particular we obtain the convergence rate n~'^^^ for d = 2, which is the 
optimal rate for functions with smoothness of order 2 if we had known that 
fj = for j = q + 1, . . . ,p in advance. Thus, the adaptive groupLasso is 
oracle-efficient. 



For d = 2, this result is also in agreement with the asymptotic properties 



of the ACOSSO [Storlie et al. 2011 , although the admissible function space 
is different (second order Sobolev space of periodic functions), the covariate 
values come from a tensor product design, and the penalty is different. For 
appropriate growth rates of the smoothing parameters, the ACOSSO selector 
also has optimal convergence rates and is selection consistent, implying the 
oracle property. 



Xue 2009 provides model selection consistency and optimal convergence rates 



for his groupSCAD selector under somewhat different but qualitatively compa- 
rable assumptions: The density function g{x) of the random covariable x is 
continuous and bounded away from zero on its (compact) support, the func- 
tions fj are d-times continuously differentiable, the polynomial spline estimators 
have degree d, and the number of interior knots goes to infinity with the order 

Huang et al. |2010| also show that the (non-adaptive) groupLasso is sub- 



consistent, that is the selected functions contain the subset of nonzero functions 
with probability tending to one, but the selected set will be larger than the 
set of nonzero functions. A similar result on sub-consistency is valid for the 



sparseness-smoothness penalty selector of Meier et al. 2009 



Asymptotic results as the ones sketched in this subsection provide valuable 
insight into the properties of selection and estimation procedures. There are, 
however, essential restrictions which may limit the usefulness of asymptotics of 
this type in applications: 

First, all asymptotic results assume a fixed sequence of penalty parameters, 
with appropriately chosen growth rates. In practice, however, penalty param- 
eters are chosen by data driven procedures such as cross-validation, AIC, BIG, 
etc. Second, although the various penalties can also be combined with other loss 
functions, in particular (negative) log-likelihoods of high-dimensional general- 
ized additive models, extensions of existing asymptotic results to these situations 
of practical interest are missing and quite challenging. Third, the assumptions 
on covariates rule out time trends f{t), which are useful in predictors for longi- 
tudinal data, or spatial effect /(s) in geoadditive regression models. 



3 Bayesian Function Selection 

3.1 Bayesian Sparseness and Smoothness Priors 

For Bayesian function selection, the first immediate idea might be to transfer 
the penalty-based approaches discussed so far into corresponding prior formu- 
lations. Indeed, if pen(/j) is a smoothness or sparsity penalty for a function, a 
corresponding prior can always be constructed via 

P{fj) « exp (-pen(/j)) 

although the resulting prior may be (partially) improper, i.e., it can not neces- 
sarily be normalized to integrate to one. By construction, large values of the 
penalty now correspond to a rather small prior "probability" such that a priori 



functions with small penalty values are favored. Since the penalties considered 
in function selection enforce sparsity and/or smoothness of functions, the cor- 
responding Bayesian priors also yield an a priori preference for sparse models 
or smooth function estimates. Quite often, the prior will contain additional 
parameters controlling in particular its variability and therefore governing the 
impact of the prior on the posterior. These variance parameters then provide 
the Bayesian counterpart to the frequentist penalty parameters. 

The reformulation of penalties as priors establishes an immediate connection 
between penalized maximum likelihood estimates and Bayesian posterior mode 
estimates. Since the posterior for a model comprising functions is 
given by 

,n X piy\fi,---,fp)pifi) - ■■■■Pifp) 
p{fi,---,fp\y) = 



p{y) 

(assuming a priori independence of the functions /i, • • . , /p so that the joint 
prior factorizes into the product of the individual priors) maximizing the pos- 
terior is indeed equivalent to maximizing the penalized log-likelihood 

K/i,---,/p)-Elog (p(/,)). 

Many commonly applied smoothness penalties are based on basis function 
representations of the nonparametric functions such that fj = XjfBj and a 
quadratic penalty 

Pen(/j) = pen(/3j) = Xj/3'jKjPj 

with positive semidefinite penalty matrix Kj. The corresponding Bayesian 
smoothness prior is then multivariate Gaussian with prior density 




piPjli) oc exp --^f3'jK,(3^ . (15) 



The penalty matrix Kj can be interpreted as a prior precision matrix inducing 



correlation properties. The prior variance r? can be linked to the smoothing 



parameter of the penalty via Xj = \tJ''^ and therefore has to be interpreted 
inversely to the smoothing parameter, i.e., large variances correspond to only 
moderate impact of the prior on the posterior, while small variances induce 
a large impact of the prior. In most cases, further hyperpriors are placed on 
the variance parameter to enable a data-driven choice of the prior impact, see 
for example Fahrmeir et al. [2010 or Fahrmeir and Kneib |2011|. The most 



common, conjugate choice would be an inverse gamma prior ~ IG{a, b) with 
a = b = 0.001 as a default option. 

The above construction principle does not only work with smoothness pri- 



ors but also with sparseness priors. For example, the Bayesian LASSO [Park 



and Casella 2008 has been introduced as a Bayesian counterpart to LASSO 



regression and relies on the prior specification 



p(/3,-|A)ocexp(-A|/3,-|) 



(16) 



for an individual regression coefficient. This Laplace prior is exactly the Bayesian 
analogue to the absolute value penalty considered in the frequentist framework. 
In fact, the equivalence between posterior modes and penalized maximum like- 
lihood allows to transfer basically all penalties discussed in the previous section 
to the Bayesian framework. However, Bayesian inference will usually rely on 
Markov chain Monte Carlo (MCMC) simulation techniques that require either 
closed form full conditionals or suitable proposal densities. These are typically 
difficult to obtain with non-standard priors comprising for example combinations 
of squared penalties and absolute value penalties or special penalties like the 
SCAD. The most convenient framework is obtained with conditionally Gaus- 
sian priors, i.e., hierarchical prior specifications where the prior for the basis 
coefficients is Gaussian given suitable hyperprior specifications. In fact, quite 
different marginal priors for the basis coefficients may arise given specific hyper- 
priors. Griffin and Brown [2005 and Poison and Scott |2012| contain overviews 



of some sparseness-inducing possibilities and their properties. For example, the 
Bayesian LASSO can be formulated as a scale mixture of Gaussian prior with 
exponential hyperprior on the variance of the Gaussian distribution in the fol- 
lowing hierarchy: 

/3,|A,r| ~ N{0,T^), t||A ~ ExpC^A^). 

After integrating out r?, the regression coefficient /3j has a marginal Laplace 
prior (16). By treating as additional unknown parameters arising from the 
scale mixture representation in conditionally Gaussian priors for the regression 
coefficients, efficient Gibbs samplers can be constructed if the observation model 
is also Gaussian or has a latent Gaussian representation (as for example in the 
case of probit models). In all other cases, iteratively weighted least squares 
proposals within a Metropolis Hastings algorithm are a suitable alternative, see 



Fahrmeir et al. 2010 for details. 

From a conceptual point of view, the equivalence of priors and penalties could 
now be used to implement Bayesian function selection by applying the sparsity 
and smoothness penalties from the previous section. For example, the Bayesian 
analogue to the group LASSO penalty ([?]) would induce the prior 

p{fij) oc exp {-X\\f3j\\K,) , Kj = x'jXj 



that would replace the Gaussian standard prior (15). However, the automatic 
selection properties of popular sparseness penalties are typically lost in the 
Bayesian framework when inference is based on MCMC. There are at least 
two reasons for this: First, the marginal posterior medians or means are easily 
obtained from MCMC output while the posterior modes are more difficult to 
obtain in a sampling-based approach. Unfortunately, posterior means and me- 
dians do not possess the selection properties of the mode unless the posterior 
is exactly symmetric around zero. Second, MCMC inference is sampling based 
and the inherent sampling variability will almost surely prevent the algorithm 
from estimating effects to be exactly equal to zero even if the posterior mean or 
median should in fact be zero. However, the Bayesian approach for example to 
LASSO regularisation also has a number of remarkable advantages: First of all, 



the Bayesian inferential scheme, unhke the frequentist optimisation approaches, 
does not only provide point estimates but the complete posterior distribution. 
As a consequence, estimation uncertainty is easily addressed and the selection 
of "relevant" effects can be based on suitable measures of this uncertainty. In 
addition, Bayesian smoothness and spareness priors can easily be combined with 
other complex predictor structures such as nonlinear or spatial effects due to 
the modular nature of MCMC algorithms. Finally, there is some empirical ev- 
idence that Bayesian sparseness priors can yield a considerable improvement 
in terms of prediction accuracy when being compared to standard prior struc- 



tures, see for example Kneib et al. 2011 , Konrath et al. 2012) who compare 



several prior structures for high-dimensional parametric predictor components 
in additive models and additive hazard regression. 



3.2 Indicator Selection Approaches 

Due to the difficulties with Bayesian sparseness priors discussed in the previous 
paragraph, Bayesian selection schemes based on indicator variables have been 
developed that explicitly include or exclude terms from a model. For example, 
model ([2]) could be replaced by 



r]i = ^0 + lifiixii) + ...+ -fpfp{xip) 

where the indicator 7^ G {0,1} can be used to include (7^ = 1) or remove 
(7j = 0) terms from the model. To further improve Bayesian function selection, 
most approaches discussed in the following use a decomposition of terms fj 
into a simple, parametric part and more complex deviations. For example, in 
the context of nonparametric regression where fj corresponds to the potentially 
nonlinear effect of a continuous covariate Xj, it makes sense to decompose the 
total effect into a linear part xj^qj and the deviation from the linear part fj{xj), 
i.e., 

fj{xj) = Xj/3oj + fj{xj). (17) 

Instead of putting one joint selection indicator on the overall function fj, one can 
then assign separate selection indicators to the linear effect and the nonlinear 
deviation. As a consequence, one can determine whether the effect of covariate 
Xj shall be included at all, whether it can be adequately described by a linear 
effect or whether more complex nonlinear modelling is indeed needed. While 
the decomposition is easily incorporated in the Bayesian selection framework by 
modifying the design matrices of the model before actually starting estimation, 
it is usually more difficult to incorporate in frequentist approaches relying on 
sparseness or smoothness penalties. 

In a Bayesian formulation, the selection indicators 7^- are then treated as 
additional unknowns that are assigned a Bernoulli prior jj ~ Be(7rj) where 
TTj = P(7j = 1) is the prior inclusion probability of term fj and additional prior 
hierarchies may be placed on TTj. Incorporating the indicators 7^ in an MCMC 
estimation scheme allows to estimate posterior inclusion probabilities -P(7j = 



l\y) for example by calculating the relative frequency of samples with 7^ = 1 
(although more efficient estimates can be obtained from Rao-Blackwellization, 
i.e. by averaging the full conditional probabilities that are used for proposing a 
new state of 7^ in the MCMC algorithm). Applying a threshold to the estimated 
inclusion probability then provides a means of function selection. 

If interest is not on one final model but on incorporating model uncertainty in 
the estimation of the functions, Bayesian model averaging can be implemented 
by averaging over all samples obtained from the different configurations of se- 
lection indicators. This in fact yields estimates that are weighted averages of 
the individual model estimates with weights proportional to the posterior model 
probabilities. 



Cottet et al. 2008 consider function selection in double exponential families 
where separate predictors can be placed on expectation and variance of the 
responses. Generalized additive models are contained as a special case and 
we will restrict our attention to this situation. Nonlinear effects of continuous 
covariates are modelled utilizing the RKHS representation of smoothing splines 
that exhibits a direct differentiation into a linear function part and nonlinear 



deviations as in (17) and yields a zero mean Gaussian smoothness prior for the 



nonlinear effects with covariances 

1 2^/ 1 



Cov{fj{xj),fj{x'j)) = TfC{xj,x'j) and C{x, x') = (^x' 



X 

3 



To make inference via MCMC numerically tractable, a low rank approxima- 
tion based on the spectral decomposition of the covariance matrix is employed, 
leading to 

fj=X,Pj, Pj^N{0,Tfl). (18) 



The very small dimension of /3j chosen by Cottet et al. 2008 in their anal- 
yses may also be related to mixing problems associated with larger blocks of 
coefficients (as detailed in the next section) but since no implementation of the 
approach has been made available by the authors, this conjecture can not be 
checked empirically. An inverse gamma prior is assigned to the smoothing vari- 
ances Tj of the nonlinear effects to include estimation of the required amount 
of smoothness. To be able to differentiate between the necessity of linear and 
nonlinear effects for specific covariates, each covariate is assigned two indicator 
variables: one for the linear part of the effect and one for a completely nonlinear 
model. By assuming a priori dependence between the indicators such that the 
indicator for the nonlinear part is only nonzero if the indicator for the linear 
effect is nonzero, the model is made identifiable. 



For the selection indicators, Cottet et al. |2008| consider i.i.d. Bernoulli priors 



-fj ~ B{7rj) with a uniform prior iTj ~ U[0, 1] for the inclusion probability. This 
corresponds to the special case of i.i.d. Beta(a, b) priors with a = b = 1. While 
a flat prior seems to make sense intuitively in representing the absence of prior 
knowledge on the model complexity, it in fact induces a binomial distribution 
with success probability 0.5 for the number of terms included in the model such 
that it favors models of intermediate size and has prior expectation of models 



of size p/2. As a consequence, more refined priors have been developed and will 
be discussed in more detail in the following in the context of specific proposals. 
Instead of cubic smoothing splines, Panagiotelis and Smith 2008) consider 



penalized splines in the spirit of Eilers and Marx 



1996 



for modelling non- 
parametric effects in a purely additive model. In order to achieve a separation 
between linear and nonlinear part of a function, they impose a constrained prior 
on the penalized splines that effectively removes the null space from the penalty. 
When using penalized splines with second order difference penalty, this means 
that a linear constraint is placed on the basis coefficients that restricts the linear 
part of the function to be zero. Then, a separate linear effect can be included in 
the model and an explicit differentiation is possible. The linear constraint in the 
prior for the penalized spline basis leads to similar constraints in the posterior 
but fortunately such a constraint can easily be incorporated when sampling from 



Gaussian full conditionals or proposal densities (see |Rue and Held 2005) for 
corresponding simulation algorithms). As a beneficial side effect, the constraint 
also allows to formulate the complete model in terms of proper priors which is a 
prerequisite for Bayesian function selection. Due to the explicit differentiation 



between linear and nonlinear effects, Panagiotelis and Smith 2008 can put sep- 



arate selection indicators on them without additional hierarchical requirements. 
As a prior for the selection indicators, they do not consider an i.i.d. Bernoulli 
prior with flat hyperprior but the prior 



1 



V 



-1 



where 7 denotes the complete vector of selection indicators, p is the number of 
model terms under selection and is the number of indicators that are non- 
zero, i.e., the number of terms to be included. This prior structure implies a 
uniform prior 



1 



p — 1 



on the number of model terms and therefore assigns equal probability to all 
possible model sizes. To circumvent sampling problems that have been present 



in previous function selection approaches such as Yau et al. 2003 , Panagiotelis 



and Smith |2008| propose a sampler where both the selection indicator and 
the basis coefficients are updated simultaneously. However, their proposal is 
currently limited to models with Gaussian responses and no implementation is 
publicly available. 

A related approach to function selection based on categorical instead of bi- 
nary selection indicators is proposed in Sabanes Bove et al. 12011 to directly 



incorporate prior information on the model space in the model formulation. For 
each nonlinear effect fj in the model, Sabanes Bove et al. 2011| define a dis- 
crete, finite list d = {0, 1, . . . ,K} of potential degrees of freedom comprising 
df = (no effect) and df = 1 (linear effect) as special cases. This can be seen 
as a categorical extension of the binary indicators that only allow for inclusion 
or exclusion of effects and model the variability of effects given the indicator 



in a separate step. Instead, Sabanes Bove et al. [2011 combine inclusion or 



exclusion and variability of the effect estimates in one joint indicator. 

The functions fj are specified in mixed model representation as in (18) and 
a g-type prior 

is specified for the "fixed effects" part of the model after marginalizing with 
respect to the "random effects" part of the resulting mixed model where g is 
a scaling factor that is assigned a further hyperprior and Jq is the prior pre- 
cision matrix obtained in the marginal mixed model representation. Posterior 
inference on the model space is then accomplished by MCMC sampling where 
special sampling types are defined to move along the model space. More specif- 
ically, they consider the sampling steps "Move", corresponding to a change in 
the degrees of freedom of a specific term from the current value to one of the 
neighboring values in the prespecified list, and "Swap", where a pair of terms is 
selected and the degrees of freedom for the two terms are interchanged. The lat- 
ter step is introduced to allow the model to account for collinearity by swapping 
terms associated with highly correlated covariates. 

The model space for a specific model instance can then be described by a 
p-dimensional vector d that contains the degrees of freedom for the various 
terms. Assuming for simplicity that K + 1 possible degrees of freedom have 
been prespecified for each term (including the special cases df = and df = 1), 
the prior on the model space is constructed so that all positive degrees of freedom 
are equally probable a priori for each included model term and so that the prior 
on the number of included model terms q is uniform. This leads to the prior 



p{d) 



that has the nice feature that for each term the prior inclusion probability (with 
any positive value for the degrees of freedom) is ^ while avoiding a preference 



for models of intermediate size. Sabanes Bove et al. 20111 also describe how to 



adapt the (/-prior approach to generalized additive models based on a Laplace ap- 
proximation. An implementation for both Gaussian and non-Gaussian responses 
with binomial or Poisson distribution is provided in the R-package hypergsplines 



Sabanes Bove, 2012 . The sampling algorithm is a two-step procedure that first 



marginalizes over the spline coefficients to sample only model configurations d, 
and then samples coefficients for models whose posterior support exceeds a cer- 
tain threshold. 



3.3 Spike and Slab Priors 

One common problem with the approaches discussed in the previous paragraph 
is that the discrete-continuous mixture of priors may lead to considerable diffi- 
culties when trying to achieve satisfactory sampling performance. Quite often, 
the mixing properties of the resulting chains are affected by the chain getting 
stuck in basins of attraction corresponding to specific selection configurations. 



This problem can sometimes be alleviated by considering marginalized sampling 
for the indicators after integrating out the corresponding regression coefficients 
or by joint updates but these approaches are mostly restricted to Gaussian re- 
sponses where - for example - marginalization can be implemented analytically. 

As a consequence, a second branch of Bayesian selection approaches replaces 
the point mass in zero corresponding to exact selection or exclusion of effects 
by a continuous approximation. In this case, the prior for an effect is a mixture 
of a very narrow component that basically reduces the effect size to zero (the 
so-called spike) and a usual standard prior that is spread out over the domain 
of the parameter space (the slab). For example, when selecting single regression 
coefficients Pj, such a prior may be 

/3, ~ (1 - 7r,-)A^(0, r2) + ^,-iV(0, rf) 

where Tq is chosen to be a very small constant such that the first part of the prior 
approximates a point mass in zero while an inverse gamma prior may be assigned 
to the variance of the second prior component rf. The approaches based on 
binary selection indicators discussed in the previous section can be considered 
as limiting cases with Tq — )• such that the spike part concentrates in a Dirac 
measure in zero. However, this limiting case often induces the same mixing 
problems and related necessity for marginalizing out the regression coefficients 
for sampling the indicator variable as discussed above. 

In the function selection framework, spike and slab priors are more conve- 
niently placed on the smoothing variances tJ where the spike could then be 
an inverse gamma distribution concentrated close to zero while the slab could 
be an inverse gamma prior with an additional hyper parameter. The underly- 
ing reasoning is that when setting the smoothing variance to zero, this induces 
maximal penalty and therefore also induces a reduced model. When decompos- 
ing the nonlinear effect into a linear one and nonlinear deviation, setting the 
smoothing variance to zero for example induces the deletion of the nonlinear 
deviation from the model. There are basically two advantages of placing the 
spike and slab prior on the smoothing variance: On the one hand, it avoids 
the need to define multivariate spikes and slabs that would be needed for the 
basis coefficient vectors and, on the other hand, the original prior structure can 
be kept for the basis coefficients and modifications are only required in higher 
levels of the prior hierarchy (and therefore also the resulting full conditionals). 



Scheipl et al. [2012 propose a very general spike and slab prior for function 
selection in structured additive regression models embedded in the exponential 
family regression framework where the predictor may not only comprise linear 



and nonlinear effects of continuous covariates as in (18) but also interaction 



surfaces, spatial effects or random effects. Similarly as in Cottet et al. 2008 



all these terms are parameterized in the form (18). They place the spike and 
slab structure on the smoothing variance tJ = Ijvj with v'j ~ IG{aj,bj) and 
the selection indicator following the two-component mixture 

7j ~ (1 - 7rj)6ug + TTj6u^ 

where 6iy denotes a point mass in u and < <^ ui are constants chosen to 
scale the variance Vj either to a very small value {uq close to zero) or a larger 



positive value (z/i considerably larger than i/q). To avoid mixing problems aris- 
ing with the direct application of the spike and slab prior structure to coefficient 
blocks of more than moderate size, Scheipl et al. 2012 propose a multiplicative 



parameter expansion where the basis coefficients /3,- are replaced by /3 • 



UJ 



with a scalar parameter ujj that represents the overall importance of the effect 
and a vector representing the distribution of the importance across the basis 
coefficients. The spike and slab prior structure is then placed on the scalar im- 
portance, i.e., LJj ~ N{0,'yjVj), while the components of are assumed to be 
i.i.d. from the mixture 0.5A^(1, 1) -|- 0.5A'^(— 1, 1). As a consequence, the dimen- 
sion of the parameter associated with the selection prior is always equal to one, 
which has beneficial impact on the mixing behaviour of the MCMC algorithm. 
Similar as in the previous approaches, Scheipl et al. 2012| apply a mixed model 
representation to all model terms that allows to separate for example linear 
effects and nonlinear deviations for continuous covariates and therefore allows 
to separately include or exclude linear effects and nonlinear deviations. Since 
the spike and slab prior is placed on the variances, usual updates for the re- 
gression coefficients can still be used (with some minor modifications due to the 
parameter expansion) such that the prior structure can readily be used with all 
exponential family response types. The approach is also implemented in the 



R-package spikeSlabGAM IScheipl 2011b 



Reich et al.| |200^ embed their Bayesian function selection approach in the 



context of reproducing kernel Hilbert spaces such that both the basis and the 
prior precision matrix arise from the reproducing kernel. They also consider 
indicators on the smoothing variance, or more precisely on the smoothing stan- 
dard deviations tj, in the limiting form with a point mass spike in zero. This 
leads to a type of zero-inflated prior distribution for the standard deviations 
with a point mass in zero and a continuous, positive part. For the latter, Reich 



et al. 12009 consider a half Cauchy distribution and propose hyperparameter 



choices that control the long run false positive rate of the algorithm. Sampling 
for the selection indicators is not performed in a separate step but - similar as in 
the proposal by Panagiotelis and Smith [2008 via a joint update of the indicator 
and standard deviation. Currently, the approach by Reich et al. 12009 is limited 



to Gaussian responses and also seems hard to extend beyond that setting. 



4 Empirical evaluation 



In the following Sections 4.1 4.2 and |4.3| we compare the performance of a 



number of approaches, all of them implemented either in R R Development 



Core Team 2011 or MATLAB MATLAB 2010 



these constitute the currently available implementations for function selection in 
additive models. ACOSSO 



Storlie et al. 



2011 



because earlier simulation studies in Scheipl 2011a showed it to be not compet- 

for which a rudimentary 



To the best of our knowledge, 
ations for function selection in 
was left out of this comparison 



Xue 



2009 



itive. We did not include the approach of 
MATLAB implementation is available, as the available code does not perform the 
essential smoothing parameter estimation. Our empirical evaluation therefore 
compares: 



avalos: Parsimonious additive models described in Avalos et al. [2007 



available as a collection of MATLAB scripts. Note that the MATLAB 



scripts of Avalos et al. were run under the GNU Octave system Eaton 



et al. , 2008 , which is known to be somewhat slower than MATLAB, so our 



timings for avalos are pessimistic. 



hgam: the high-dimensional additive model approach of Meier et al. |2009| , 



as implemented in R-package hgam Eugster et al. 2010 . Smoothing 



parameters for hgam are determined via five-fold cross validation, 
hypergsplines: objective Bayesian model selection with penalised splines 



as described in 


Sabanes 


Bove et al. 


2011| 


|Sabanes Bove 




2012 . 


We used S 


cubic 



designs and ran the sampler for the model indicators for 10000 iterations, 
followed by generating 2000 samples from the models representing 90% of 
the posterior mass explored by the model indicator sampling phase. 



mboost: componentwise gradient boosting as implemented in m boost [Hothbrn 
et al. 2012 . We supply separate base learners for the linear and smooth 



parts of covariate influence. We use 10-fold cross validation on the train- 
ing data to determine the optimal stopping iteration (with a maximum 
of 1000 iterations) and count a baselearner as included in the model if it 
is selected in at least half of the cross-validation runs up to the stopping 
iteration. 

mgcv: the double shrinkage approach for GAM estimation and term selec- 
tion described in iMarra and Wood 2011| , as implemented in R-package 



mgcv |Wood| |2012 



oraclemgcv: additive models as implemented in R-package mgcv based 
on the true model formula to serve as a reference for estimation accuracy. 



spam: the approach for sparse additive models by Ravikumar et al. |2009| , 
available as a collection of R-scripts. 

spikeSlabGAM: stochastic search term selection for GAMMs as described 
in Scheipl et al. 2012 and implemented in spikeSlabGAM [Scheipl,|2011b 



Results for spikeslabgam based on the default settings supplied in the 
package. 



• stepwise: Stepwise selection as described in 


Belitz and Lang 


2008 , im- 


plemented in BayesX jBelitz et al. 




2012 , accessed via R2BayesX jUmlauf 



et al. 2012 



For approaches returning estimated effective degrees of freedom (mgcv, spam), 
we count a covariate included as a linear effect if its estimated effective degrees of 
freedom are > 10~^ and as included as a non-linear effect if its estimated effective 
degrees of freedom are > 1. For approaches returning only the estimated effects 
(avalos, hgam), we cannot differentiate between linear and non-linear inclusion 
and count effects as included both non-linearly and linearly if the norm of the 



estimated effect vector is > 0. For approaclies returning estimated posterior 
inclusion probabilities (hypergsplines, spikeslabgam), we use probability 0.5 
as a threshold for inclusion. 

Table [T] summarizes the properties and capabilities of the compared imple- 
mentations. 



Algorithm 


Language 


GLM 


CIs 


Cat. 


Lin. 


Ran. 


Non-SeL 


d > n 


aval OS 


MATLAB 














V 


hgam 


R 














V 


hypergsplines 


R, C++ 














V 


mboost 


R, C 














V 


mgcv 


R, C 
















speun 


R 














V 


spikeslabgEun 


R, C 














V 


stepwise 


R, C++ 














V 



Table 1: Properties and capabilities of the compared implementations. 
"GLM" indicates whether function selection in models for non-Gaussian re- 
sponses (binary, Poisson, etc.) is implemented. "CIs" indicates whether any 
variability estimates for the effects are returned. "Cat." indicates whether 
estimation and selection of effects of categorical covariates is possible. "Lin." 
indicates whether the implementation explicitly differentiates between inclu- 
sions of linear and non-linear, smooth effects. "Ran." indicates whether the 
implementation allows for estimation and selection of random effects. "Non- 
Sel." indicates whether the implementation allows for a subset of covariate 
effects that is always included and not under selection, "d > n" indicates 
whether the implementation allows for coefficient vectors whose dimension d 
is bigger than the number of observations n. 

4.1 Simulation study 1: Additive model 

We simulate data from the following data generating process of a sparse additive 
model: 

• We define 4 functions 

— fi{x) = X, 
-/,(x) = x+(?S^, 

^ /3(x) = — j; + vr sin(7rx), 

— fiix) = 0.5x + 150(2(x — .2)) — (p{x + 0.4), where (/>() is the standard 
normal density function, 

which enter into the additive predictor. 

• We define 2 scenarios: 

— a "low sparsity" scenario: Generate 16 covariates, 12 of which have 
non-zero influence. The true additive predictor is 

V = fi{xi) + /2(x2) + fsixs) + U{X4)+ 



+ 1.5(/1(X5) + /2(X6) + fsixj) + h{x8)) + 
+ 2(/i(x9) + /2(xio) + Mxu) + U{X12)). 



— a "high sparsity" scenario: Generate 20 covariates, only 4 of which 
have non-zero influence and ry = /i(xi) + f2{x2) + fsixs) + /4(x4). 

The covariates are either 

— U[-2,2] or 

— from an AR(1) process with correlation p = 0.7. 

= + Ci, i = 1, . . . , n with '■~ ■ A^(0, cr^) 
number of observations n = 100, 1000 
signal to noise ratio ^^^''^ = 3, 10 

We simulate 50 replications for each combination of the various settings. 



This simulation study setup has previously been used in Scheipl 2011al|. Pre- 
dictive root mean square error (RMSE) is evaluated on test data sets with 5000 
observations. Selection performance, i.e. how well the different approaches select 
covariates with true influence on the response and remove covariates without 
true influence on the response is measured in terms of accuracy, defined as the 
number of correctly classified model terms (true positives and true negatives) 
divided by the total number of terms in the model. For example, the full model 
in the "low sparsity" scenario has 32 potential terms under selection (linear 
terms and basis expans ions/smooth terms for each of the 16 covariates), only 
21 of which are truly non-zero (the linear terms for the first 12 covariates plus 
the 9 basis expansions of the covariates not associated with the linear function 
/i(xi)). Accuracy in this scenario would then be determined as the sum of the 
correctly included model terms plus the correctly excluded model terms, divided 
by 32. 

Figure [T] shows relative RMSE for the different settings and algorithms. Rela- 
tive root mean square errors are always relative to the error of the "true" model 
estimated with mgcv (oraclemgcv). Note that results for stepwise for corre- 
lated covariates are omitted as they were orders of magnitude larger than the 
rest of the results. Figure [2] shows the accuracy of the variable selection perfor- 
mance in terms of the proportion of correctly excluded or included effects for 
the different settings and algorithms. Figure [3] shows computation times for the 
different algorithms and settings 

Most implementations were very stable, returning useful results for all 7200 
model fits. Notable exceptions were stepwise, which failed to return rea- 
sonable fits for any replication of the settings with correlated covariates, and 
hypergsplines, which did not return predictions for 37 [47] of the 50 replicates 
for = 100, SNR= 3 and low sparsity and correlation [0.7] and also failed in 
25 replications for A^ = 100, SNR= 3, low sparsity and correlation 0.7. Graphs 
and discussed results are based on all available results. Using only replicates for 
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Figure 1: Relative RMSE for the different settings and algorithms. Rows for 
the different combinations of sparsity and signal-to-noise ratio, columns for 
the different combinations of number of observations N and correlation p of 
the covariates. Vertical gray line denotes a relative RMSE of 1, i.e. parity with 
the mgcv-estimate of the "true" model. Note that results for stepwise are not 
shown for correlated responses since it did not return reasonable predictions. 
Also note that hypergsplines did not return results for some data sets and 
only available fits are shown here. 
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Figure 2: Proportion of correctly in- or excluded covariates for the different 
settings and algorithms. Rows for the different combinations of sparsity and 
signal-to-noise ratio, columns for the different combinations of number of ob- 
servations N and correlation p of the covariates. Dot dcnot(^s mean accuracy, 
lines stretch from minimum to maximum accuracy observed over the 50 repli- 
cations. Note that results for stepwise are not shown for correlated responses 
since it did not return reasonable predictions. Also note that hypergsplines 
did not return results for some data sets and only available fits are shown here. 
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Figure 3: Computation times. Boxplots are color-coded for the dimension- 
ality of the covariate vector, p = 20 corresponds to the sparse setting with 4 
no-zero and 16 zero effects and p = 16 corresponds to the non-sparse setting 
with 12 non-zero and 4 zero effects. Horizontal axis on logj^o'^^^ale. Note that 
results for stepwise do not include timings for correlated responses since it 
did not return sensible fits there. Also note that hypergsplines did not re- 
turn results for some data sets and only timings for the available fits are shown 
here. 



which hypergsplines also returned predictions would not change the results 
qualitatively. 

Results for relative RMSE (c.f. Figure [T])): 

• for small A^, low SNR (4 panels on top left), spikeslabgam and hgam 
consistently beat the error of the oraclemgcv benchmark - hypergsplines 
does too, but only for the sparse setting. 

• for small A^, high SNR (4 panels on bottom left), spikeslabgam, hgam 
and hypergsplines consistently beat the error of the oraclemgcv bench- 
mark, mboost and avalos also do well for the sparse setting. 

• for larger N (two rightmost columns), spikeslabgam, mgcv and, to a 
lesser extent, hypergsplines achieve performance very close to that of 
oraclemgcv. stepwise does too, but only for non- correlated responses, 
spam and mboost do (much) worse than oraclemgcv across the board for 
large A^. 

Results for selection accuracy (c.f. Figure [2])): 

• with the notable exception of mboost in high A^, high sparsity settings 
and stepwise, selection accuracy is robust against correlated covariates 
for all algorithms. 

• for A^ = 1000, spikeslabGAM, hypergsplines, and to a lesser extent 
spam as well have almost perfect accuracy across the board. 

• for A^ = 100, we see a clear split into algorithms that tend to fit very 
sparse models (spikeslabGAM, spam, hypergsplines) and consequently 
do well in the sparse settings and more liberal algorithms (mgcv, mboost , 
hgam) that do relatively better in the non-sparse settings. In the less noisy 
non-sparse settings, spikeslabGAM and hypergsplines also do well. 



Computation times are displayed in Figure [3j The times of avalos and hgam 
scale very badly for larger data set size, while computation times for other meth- 
ods that exhibit better or at least competitive selection and estimation perfor- 
mance (i.e., spikeslabGAM, hypergsplines, and, to a lesser extent, mboost) 
remain feasible. As hypergsplines first samples only the model indicators and 
then samples from a selection of models with strong posterior support to ar- 
rive at model-averaged effect estimates, it achieves a significant speed-up in the 
sparse case (p = 20) where the models to be sampled in the second step are 
very small. All other methods are slower for p = 20 than for p = 16. Note that 
timings for spikeslabGAM, hypergsplines and mboost are (very) pessimistic 
since they do not make use of the parallelization features available for these 
3 algorithms (parallel MCMC chains for spikeslabGAM, parallel computation 
via OpenMP for hypergsplines, parallel bootstrap iterations to determine the 
stopping iteration for mboost). FigureH^shows the distribution of ranks achieved 
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Figure 4: Rank distributions achieved by the 8 different algorithms for the 
AM simulation. Ties broken by assigning minimum rank to all involved. 



on each replicate of each setting by the 8 different algorithms in order to di- 
rectly compare relative performance in terms of RMSE, selection accuracy and 
required computation time, spam and stepwise are fast, but not competitive in 
terms of estimation or selection, spikeslabgam and hypergsplines deliver the 
best estimation and selection performance, with slightly more best performances 
in shorter time for hypergsplines, but more stable results for spikeslabgam. 
hgam often delivers good estimates, but it is very slow. The best compromise in 
terms of accurate estimation and computation time seems to be offered by mgcv, 
but the selection accuracy is below average. Both mboost and avalos show in- 
termediate results in all three performance dimensions we consider. Note that 
the threshold of > 10^^ effective degrees of freedom for inclusion of a term that 



we use for spam and mgcv is very low, leading to comparably less sparse models 
and more falsely included covariates than for most other methods, especially 
so for mgcv. In that sense, results for the selection accuracy of mgcv and, to a 
lesser extent, spam are somewhat pessimistic, especially in sparse settings. 

4.2 Simulation study 2: Additive model with concurvity 

To investigate the various methods' robustness to concurvity, we use a similar 
data generating process as the one used in the previous Subsection: 

• We define functions fi{x) to /4(x) as in the data generating process for 
the previous Subsection. 

• We use p = 10 or p = 20 covariates: The first 4 are associated with 
functions /i to /4, respectively, while the remainder are "noise" variables 
without contribution to the additive predictor. 

• covariates x are ~ f7[— 2,2] 

• we distinguish 3 scenarios of concurvity: 

— in scenario 1, X4 = c • g{xs) + {1 — c) ■ u, i.e., two covariates with real 
influence on the predictor are functionally related. 

— in scenario 2, X5 = c • g{x4^) + (1 — c) • n, i.e., a "noise" variable is a 
noisy version of a function of a covariate with direct influence. 

— in scenario 3, X4 = c • g{x^) + (1 — c) • -u, i.e., a covariate with direct 
influence is a noisy version of a "noise" variable. 

where 5(x) = 2$(x,/i = -1,0-2 = 0.16)+2$(x, = l,^^ = 0.09)-4(/)(x)- 
2 with o"^) defined as the cdf of the respective Gaussian distribu- 

tion, i. i. d. standard normal variates u, and the parameter c controlling 
the amount of concurvity: c = 1 for perfectly deterministic relationship, 
and c = for independence. In our simulation, c = 0, .2, .4, .6, .8, 1. 



• we use signal-to-noise ratio SNR= 1,3. 

• we use n = 100 observations for the training data set. 

• we simulate 50 replications for each combination of the various settings. 

Predictive MSE is evaluated on test data sets with 5000 observations. A similar 
simulation study setup has previously been used for comparing a different set of 



selection and estimation algorithms in Scheipl et al. 2012 . stepwise was omit- 



ted from this comparison due to its dismal performance for correlated covariates 
(see previous subsection). Note that mgcv did not return any fits for p = 20 as 
it cannot be used to estimate models with more coefficients than observations. 

Figure |5] shows relative predictive RMSE for y for the 7 algorithms and the 
three scenarios under different concurvity, number of potential predictors p and 
signal-to-noise ratio. 

Results for relative RMSE under concurvity (c.f. Figure [s]): 
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Figure 5: Boxplots of prediction RMSE divided by prediction RMSE of the 
"true" model fit by mgcv. Horizontal line denotes relative RMSE of 1, i.e. parity 
with the "oracle" model. Columns correspond to the different algorithms, rows 
correspond to the 4 combinations of number of potential predictors p and signal 
to noise ratio. Results for mgcv for p = 20 are missing as it cannot deal with 
models with more coefficients than observations. 



• Overall estimation accuracy of E{y) seems fairly robust against increasing 
concurvity and different number of noise terms for all the methods we con- 
sider here. Relevant increases in the relative error as concurvity increases 
are visible only for hypergsplines in Scenarios 1 and 2. 

• hypergsplines , mgcv and spikeslabgam achieve performance almost equiv- 
alent to or better than the "true" model estimates across the board. 

• mboost and hgam dominate the other methods and the "true" model esti- 
mates for higher signal-to-noise ratios (lower two lines in each figure), but 
are not competitive for more noisy data (upper two lines in each figure). 



4.3 Benchmark study on binary response data 

We investigated the performance of a subset of algorithms on four real data sets 



for binary classification taken from the UCI Machine Learning Repository iFrank 



and Asuncion 2010|: the "Musk (Version 1)" (musk), "Liver Disorders" (liver) 



"Connectionist Bench (Sonar, Mines vs. Rocks)" (sonar) and "Pima Indian 
Diabetes" (pima) datasets. These datasets were picked since they do not contain 
any categorical predictors for which none of the algorithms except mboost and 
spikeslabgam can perform selection. Of the implementations described in the 
previous section, only hgam, hypergsplines, mboost, mgcv and spikeslabgam 
can deal with non-Gaussian data, so our comparison is limited to those five 
approaches. 

Table |2] summarizes properties of the four datasets. Note that only musk, and. 
Name n p n/p Balance 



liver 345 7 49 1.4 

musk 467 167 2.8 1.3 

sonar 208 61 3.4 1.1 

pima 768 9 85 1.9 



Table 2: Properties of the four benchmark datasets. n denotes the number 
of observations, p denotes the number of potential covariates, "Balance" is the 
ratio of the number of cases in the larger class divided by the number of cases 
in the smaller class, i.e.. it is 1 if the data set has equal number of successes 
(ones) and failures (zeros). 

to a lesser extent, sonar, are truly high-dimensional variable selection problems 
and that only pima is somewhat unbalanced with a ratio of about 2:1 between 
cases in the bigger class and cases in the smaller class. 

For each of the four datasets, we generated ten bootstrap training data sets 
and evaluated predictive performance for the fitted additive logistic models in 
terms of minus twice the log likelihood (deviance) on the out-of-bag samples. 
To avoid extrapolation issues for the smooth terms, all training data sets con- 
tain those observations with the minimal and maximal values of the potential 
covariates for each data set. Figure [6] depicts the ratio between the predictive 
deviance of the fitted models and a simple logistic intercept model fitted on the 




Figure 6: Dotplots of relative predictive deviance on the test data sets. Rel- 
ative predictive deviance is defined as twice the negative log-likelihood evalu- 
ated on the test data sets divided by the predictive deviance of the intercept 
model on the training data, i.e., lower is better and values above 1 indicate 
predictions worse than those of a simple intercept model on average and indi- 
cate substantial overfitting. Results on the same test set connected by lines. 
Numbers in brackets after algorithm names indicate the number of times the 
algorithm failed to fit a model and/or return predictions. For the datasets in 
the right column, the full models have more coefficients than observations. 



training data, i.e., lower is better. Values above 1 indicate predictions worse 
than those of the simplest conceivable model on average and indicate substan- 
tial overfitting. Note that mgcv can not fit the musk and sonar datasets since 
the full model would have more coefficients than observations, hgam did not 
return any predictions for sonar due to numerical problems. For the remaining 
data sets, numbers in brackets after algorithm names indicate the number of 
times the algorithm failed to fit a model and/or return predictions. Results for 
the binary classification benchmark (c.f. Figures [6| [?]) show that: 

• mgcv and spikeslabgam have very similar predictive performances for the 
small data sets (liver, pima), but mgcv is orders of magnitude faster than 
all of the other algorithms. 

• mboost's performance is the most stable across replicates on each dataset 
and competitive to the best methods (mgcv and spikeslabgam). Its com- 
putation time is about the same as that of spikeslabgam for the small 
data sets and three to eight times faster for the high-dimensional problems. 

• spikeslabgam performs consistently well on all four data sets, but its 
computation time is much larger than that of its closest competitor on 
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Figure 7 : Boxplots of computation time for binary classification data sets in 
minutes. Time axis on binary log-scale. 



the small data sets (mgcv) and its closest competitor on the large data 
sets (mboost). 

• hgam's performance is also fairly stable across replicates for the small data 
sets, but much less competitive. For the high-dimensional data sets, it 
becomes very unstable and does not return predictions at all for seven out 
of ten replicates for musk and no predictions at all for sonar. 

• the performance for hypergsplines is similar but slightly worse than 
spikeslabgam for the smaller problems, but a lack of numerical stabil- 
ity of the implementation leads to a failure to return any predictions for 
five of the sonar test sets and six of the musk data sets. The remaining 
fits for the high-dimensional problems are not competitive, with relative 
predictive deviance values above 1 for the Sonar dataset indicating severe 
overfitting. 

• computation times for hypergsplines and hgam are much larger than 
those of the other algorithms which yield better predictions in shorter time. 
The time difference is two orders of magnitude for the small datasets. 



5 Conclusions 



Currently, function selection is mainly concentrated on additive predictors of 
scalar Gaussian or exponential family responses. Extensions to other regression 
models for scalar responses, such as survival models, quantile regression and 
other beyond mean regression models would be desirable but are still sparse or 
completely lacking. A wide and open field for future research is variable and 
function selection in multivariate regression and latent variable models, where 
the association structure may also depend on covariates. 
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